Introducing cl-data-frames

A little bit of background
Design goals
Decisions, decisions…
General architecture
Key concepts
Data structures
High-level API
Use cases
The future

A little bit of background

The key feature of commonly used languages for data science and statistical analysis is the existence of data frames. Data frame is essentially just a way of presenting data in the form of a table, plus a set of operations allowing easy manipulation of the content. This probably does not sound all that thrilling at this point but once the domain is considered one quickly realizes that in practice almost every frequently performed task should be performed using a data frame. Don't get me wrong: It still can be done without one, but by sticking to just the fundamental data structures you are simply wasting your time. Let's, for instance, consider a typical code written in R with the help of the ubiquitous Dplyr package.

We have a data frame containing data set with a human body type information: weight and height under the columns "weight" and "height". We decided to add a new column to the data frame with a body mass index (BMI for short, and that's how the new column will be called). To do so, we will simply need a single line of the code (assuming that data set uses metric units).

transform(body_data_frame, bmi = weight / height^2)

You may be wondering: how does this even work? There is no lambda form to be found and the result of calling the transform is not stored anywhere. Well, unless you know R already! Dplyr makes good use of R features: first-class expressions and the first-class environments. First-class expressions are somewhat like first-class functions. Just like some languages allow you to pass a function as an argument to another function, R allows you to pass non-evaluated expression. The difference is that while functions are embedded in the lexical environment which is augmented during the call by creating new bindings according to the lambda list of the function, expressions are not called; they are just evaluated. R allows you to pass an explicit environment where this will happen as well as facilities to programmatically construct environments during the run-time. This is even more powerful then automated lexical environment augmentation during the function call!

Transform iterates row-wise over the whole frame, each time evaluating the passed expression in an environment containing additional bindings to the values in this row, under each column. After evaluation values can be accessed, manipulated and transferred to a new data frame. That is, new. Transform does not perform destructive manipulation of the existing data frame. Instead, it will mutate the outer environment to alter the binding of the body_data_frame so it will point to a newly constructed data frame (in addition to this dplyr provides a completely non-destructive function (sic!) mutate that does not mutate anything). This is a quite elegant (and maybe a little bit magical) solution!

Unfortunately, R is not ideal. Even after climbing the initial learning curve while acquainting with the essential packages you may find yourself limited. For instance: you may ask yourself "where are the input/output streams?" and the answer would be "there are none". R is a highly specialized language and lacks many of the general-purpose features that you may take for granted. This may seem to be not a practical issue initially, but eventually, you will find yourself attempting to represent sets of boolean flags as a bit-set. You will then realize that you won't be able to force any of the libraries you are using for such important tasks like clustering to work with your new data representation.

Python's pandas library is not nearly as nice to use. Spark's data frames even less so. Also, neither of those is a lisp. Perhaps Julia is the right answer, but I am already in a deep romantic relationship with the Common Lisp. And not only that: there is a lot of practical merit in using Common Lisp in all sorts of interactive data processing tasks. Hover, as already established, I would prefer to do so using a data frames library, and so I decided to write one that would be pleasant to use.

This blog post describes this new library of the excellent and innovative name "cl-data-frames" that I cooked for my use.

Design goals

What "pleasant to use" means though? Such a vague statement is not a valid mission. What are the exact things I am looking for?

I work with sparse data sets. Data frames should be sparse to not waste memory when representing mostly null columns.
The design should be layered in such a way that it becomes possible to quickly hack in needed features. For instance, It should be possible to construct data frames in a way simple enough to integrate with other libraries. This is important because I can't even hope to support every possible data source. But If a user can, for instance; fill the data frame while iterating over postmodern query result with a DOQUERY in just a few extra lines of code, this becomes much less of a problem.
Efficient and safe data sharing between different instances of data frames. This should allow memory usage to be compact even while working with large data sets.
API should be simple. Ideally, most of the use cases should be realized with just a few functions. For instance: all row-wise operations should be performed with a single function, akin to the Dplyr mutate/transform.
Both destructive (but safe, I don't want to end up with a clobbered data just because condition popped up) and non-destructive operations.

Decisions, decisions…

Having in mind those goals, I could move forward and instantly the new questions have arisen. What layers should be distinguished in the library and what are the concepts expressed by each layer? How data representation is structured? What basic operations are supplied? Which data structures should be used?

General architecture

Cl-data-structures are composed of layers connected by sets of protocol functions, objects, classes, and variables. Those layers are, from the lowest level to the highest level:

Column layer.
Header layer.
Table layer.
API layer.

Key concepts

The column layer provides basic data structure used to represent column as well as means to change the content of it, either directly or by the use of the iterator. Iterator also provides means to alter multiple columns at once. This is needed for building and modifying whole data frames form ground up. Columns are mutable and they are using copy on write mechanism to save time and (more importantly) memory when multiple column instances share parts or the whole of the content.

Immutable headers represent the information on columns forming the data frame. This includes types stored in the columns; column aliases and predicates for the content, or in other words: data schema. Many functions assume that the relevant header is implicitly passed as a dynamic (also 'called' special) variable. This also means that quite frequently calls to cl-data-frame functions are placed within a WITH-HEADER (or WITH-TABLE) form that establishes such binding. This may appear to be inconvenient, however, this allows to maintain separation between the header and actual content, therefore allowing for aggregation without the need for an actual data frame to be constructed. This is extremely useful when the input file is larger than RAM available on the machine.

The same layer also establishes the concept of the current-row. This is an object bound to the dynamic variable during row-wise operations, granting read and write access to the content of the currently processed data frame row. Operations on the row are actually an object protocol. These features allow easier integration with the cl-data-structures library by separating the content of the current row from returned and accepted arguments in the lambda forms passed to the cl-data-structures layer and aggregation functions. This is demonstrated in the Use cases section of this post.

The table itself is a mutable object composed of header and sequence of columns. AT generic function allows to access value in the individual cell of the table, but it is strongly advised to not loop over the table, and instead use TRANSFORM function for row-wise changes (IN-PLACE argument controls if changes should be performed destructively) and a relevant function from the cl-data-structures package for an aggregation (for instance calculating sum of all elements in the column, finding minimum or maximum, and so one).

Data structures

The sparsity requirement combined with the need for both destructive and non-destructive operations convinced me to use column representation based around the sparse variation on the RRB vector data structure. Sparsity was achieved by adding a bit-mask into every RRB tree node to mark the occupied node (just like in the HAMT data structure). This increases memory requirements for each non-leaf node in the trie, but only by 32 bits extra for each node. I consider this to be good enough to not even bother with a dense variant of the column. Luckily I had those already in the cl-data-structures, but there was still plenty of code extra required for an efficient implementation of operations needed by data frames.

High-level API

For the most typical tasks, the user is expected to simply stick to the high-level API and do not dwell in the low-level details described above. Therefore I've added separate packages gathering relevant symbols from the layers below and reexporting along with added functionality. This API is composed of the following symbols:

EMPTY-TABLE
SAMPLE
NEW-COLUMNS
COPY-FROM
COPY-TO
WITH-TABLE
WITH-HEADER
STANDARD-HEADER
MAKE-HEADER
AT
COLUMN-COUNT
ROW-COUNT
BODY
RR
BRR
MAKE-HEADER
TO-TABLE
VMASK
VSELECT
VSTACK
REPLICA
REMOVE-NULLS
SHOW
ORDER-BY

I am not completely happy with the shape of this API right now, but at least I am not bothered by backward compatibility. It is nice to have a fresh start but it is not easy to get everything right. Regardless: right now It is not even a horrible system to use.

Use cases

This may seem to all nice, but I bet you wonder how this allows solving practical problems you may encounter. Let me take you on a journey through my work and how I use this software to make my life easier.

ICD10 is the international coding standard for medical diagnosis. Each code describes the individual medical condition with a sometimes amusing level of details. For instance code, W6152XA corresponds to the description "Struck by goose, initial encounter". Codes have a hierarchical structure and prefix W61 designates "Contact with birds". "Bird" is a superset of "Goose" and so is the prefix W61 of W6152XA. It happens that I work with a very large dataset containing medical records. Patients usually have multiple diagnoses, in the form of a string composed of comma-separated diagnosis. Whole data set is distributed in the form of the CSV file with the following columns:

patient_id
date
diagnosis
DRG

This schema can be easily represented as a cl-df header.

(defparameter *header* (cl-df:make-header 'cl-df:standard-header
                                          '(:alias patient_id :type integer)
                                          '(:alias date :type local-time:timestamp)
                                          '(:alias diagnosis :type t)
                                          '(:alias drg :type integer)))

We are going to split strings representing diagnosis into individual codes during loading and ensure that only a single instance of string is assigned to each code that needs to be represented. This will save a lot of memory and is quite easy to do, as demonstrated below.

(let ((table (make-hash-table :test 'equal)))
  (defun unique (x)
    (multiple-value-bind (result found) (gethash x table)
       (if found
           result
           (setf (gethash x table) x)))))

(defun split-and-unique (string)
  (if (eq :null string)
      '()
      (mapcar #'unique (cl-ppcre:split "," string))))

(cl-df:with-header (*header*)
  (defparameter *table*
    (serapeum:~> (cl-df:copy-from :csv #P"/path/to/data.csv")
                 (cl-df:to-table :header *header*
                                 :body (cl-df:body (diagnosis)
                                         (setf diagnosis (split-and-unique diagnosis))))))

Once the table is constructed analysis can be performed. In this example, we will count what is the frequency of diagnosis grouped around the DRG and diagnosis prefix.

(cl-df:with-table (*table*)
  (defparameter *diagnosis-frequency-by-drg-and-by-prefix*
    (serapeum:~> *table*
                 (cl-ds.alg:flatten-lists :key (cl-df:brr diagnosis))
                 (cl-ds.alg:group-by :test 'eql
                                     :key (cl-df:brr drg))
                 (cl-ds.alg:group-by :test 'equal
                                     :key (lambda (diagnosis)
                                            (serapeum:take 3 diagnosis)))
                 (cl-ds.alg:frequency :test 'equal)
                 (cl-df:to-table :columns '((:alias drg)
                                            (:alias icd-prefix)
                                            (:alias frequency-prefix)))))

Notice how CL-DF:BRR can be used to create function extracting data from the row, while ordinary lambda form in the second group-by can operate independently on the data extracted by the CL-DS.ALG:FLATTEN-LISTS. This is because iterating over data frame with the usage of CL-DS:ACROSS and CL-DS:TRAVERSE functions also manipulates state established in the CL-DF:WITH-TABLE macro. Even while walking over individual diagnoses flattened We still have access to the original row, because it is stored in the dynamic environment. I picked this approach because it plays nicely with already existing algorithms in the cl-data-structures, for example, grouping by DRG code can be moved down; below the grouping by the prefix (order of columns in the to-table function would have to be changed, obviously).

The future

Right now I am testing the library to find anything that may prove to be an annoying aspect of the design. I'm also still fishing out bugs in both CL-DATA-STRUCTURES and CL-DATA-FRAMES which, unfortunately, means that neither is ready for general use. I will keep you updated.

For more humane computing